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Abstract 



Several emerging petascale architectures use energy-efficient processors with vectorized 
computational units and in-order thread processing. On these architectures the sustained 
performance of streaming numerical kernels, ubiquitous in the solution of partial differential 
equations, represents a challenge despite the regularity of memory access. Sophisticated 
optimization techniques are required to fully utilize the Central Processing Unit (CPU). 

We propose a new method for constructing streaming numerical kernels using a high- 
level assembly synthesis and optimization framework. We describe an implementation of 
this method in Python targeting the IBM Blue Gene/P supercomputer's PowerPC 450 core. 
This paper details the high-level design, construction, simulation, verification, and analysis 
of these kernels utilizing a subset of the CPU's instruction set. 

We demonstrate the effectiveness of our approach by implementing several three-dimensional 
stencil kernels over a variety of cached memory scenarios and analyzing the mechanically 
scheduled variants, including a 27-point stencil achieving a 1.7x speedup over the best pre- 
viously published results. 

keywords: High Performance Computing, Performance Optimization, Code Generation, 
SIMD, Blue Gene/P 



1 Introduction 



1.1 Motivation 

As computational science soars past the petascale to exascale, a large number of applications 
continue to achieve disappointingly small fractions of the sustained performance capability 
of emerging architectures. In many cases this shortcoming in performance stems from issues 
at the single processor or thread level. The 'many-core' revolution brings simpler, slower, 
more power-efficient cores with vectorized floating point units in large numbers to a sin- 
gle processor. Performance improvements on such systems should be multiplicative with 
improvements derived from processor scaling. 

The characteristics of Blue Gene/P that motivate this work will persist in the processor 
cores of exascale systems, as one of the fundamental challenges of the exascale relative to 
petascale is electrical power [T3112S]. An optimistic benchmark (goal) for a petascale system 
is the continuous consumption of about a Mega Watt of electrical power, which represents 
the average continuous power consumption of roughly one thousand people in an OECD 
country (1.4 kW per person). Power reductions relative to delivered flop/s of factors of 
one to two orders of magnitude are expected en route to the exascale, which means more 
threads instead of faster-executing threads, power growing roughly as the cube of the clock 
frequency. It also means much less memory and memory bandwidth per thread because the 
movement of data over copper interconnects consumes much more power than the operations 
on data in registers. Mathematical formulations of problems and algorithms to implement 
them will be rewritten to increase arithmetic intensity in order to avoid data movement. 
Implementations in hardware will have to be made without some of today's popular power- 
intensive performance optimizations. 

At each stage of such a design, tradeoffs are made that complicate performance optimiza- 
tion for the compiler and programmer in exchange for improved power and die efficiency from 
the hardware. For example, out-of-order execution logic allows a microprocessor to reorder 
instruction execution on the fly to avoid pipeline and data hazards. When out-of-order 
logic is removed to save silicon and power, the responsibility for avoiding these hazards is 
returned to the compiler and programmer. In the same vein, a wide vector processing unit 
can significantly augment the floating point performance capabilities of a processing core 
at the expense of the efficient single-element mappings between input and output in scalar 
algorithms. Such vector units also incur greater bandwidth demands for a given level of per- 
formance, as measured by percentage of theoretical peak. Graphics processing units provide 
a set of compute semantics similar to a traditional Single Instruction Multiple Data (SIMD) 
vector processing unit with Single Instruction Multiple Thread (SIMT), but still execute 
in-order and require vectorized instruction interlacing to achieve optimal performance. We 
observe a broad trend to improve efficiency of performance with wider vector units and in- 
order execution units in the architectures of the IBM Blue Gene/P PowerPC 450 [38J, the 
Cell Broadband Engine Architecture [TO], Intel's MIC architecture [27], and NVIDIA's Tesla 
Graphics Processing Unit (GPU) [29J. 

In general terms, we may expect less memory per thread, less memory bandwidth per 
thread, and more threads per fixed data set size, creating an emphasis on strong scaling 
within a shared-memory unit. We also foresee larger grain sizes of SIMDization and high 



penalization of reads and writes from main memory as we move towards exascale. 
1.2 Background 

We define streaming numerical kernels as small, cycle-intensive regions of a program where, 
for a given n bytes of data accessed in a sequential fashion, 0(n) computational operations 
are required. Streaming numerical kernels are generally considered to be memory-bandwidth 
bound on most common architectures due to their low arithmetic intensity. The actual per- 
formance picture is substantially more complicated on high performance computing micro- 
processors, with constraints on computational performance stemming from such disparate 
sources as software limitations in the expressiveness of standard C and Fortran when target- 
ing SIMD processors and a host of hardware bottlenecks and constraints, from the number 
of available floating point registers, to the available instructions for streaming memory into 
SIMD registers, to the latency and throughput of buses between the multiple levels of the 
memory hierarchy. 

In this paper we focus upon stencil operators, a subset of streaming kernels that define 
computations performed over a local neighborhood of points in a spatial multi-dimensional 
grid. Stencil operators are commonly found in partial differential equation solver codes in 
the role of finite-difference discretizations of continuous differential operators. Perhaps the 
most well-known of these is the 7-point stencil, which usually arises as a finite difference dis- 
cretization of the Laplace kernel on structured grids. Although adaptive numerical methods 
and discretization schemes have diminished this operator's relevance for many problems as 
a full-fledged numerical solver, it is still a cycle-intensive subcomponent in several impor- 
tant scientific applications such as Krylov iterations of Poisson terms in pressure corrections, 
gravitation, electrostatics, and wave propagation on uniform grids, as well as block on adap- 
tive mesh refinement methods [4j. We also target the 7-point stencil's "boxier" cousin, 
the 27-point stencil. The 27-point stencil arises when cross-terms are needed such as in the 
NASA Advanced Supercomputing (NAS) parallel Multi Grid (MG) benchmark, which solves 
a Poisson problem using a V-cycle multigrid method with the stencil operator [3]. Finally, 
we examine the 3-point stencil, the one-dimensional analogue to the 7-point stencil and an 
important sub-kernel in our analysis. 

We concentrate on the Blue Gene/P architecture for a number of reasons. The forward- 
looking design of the Blue Gene series, with its power-saving and ultra-scaling properties 
exemplifies some of the characteristics that will be common in exascale systems. Indeed, 
successor BlueGene/Q now tops the GREEN500 [16] list. Blue Gene/P has SIMD registers, a 
slow clock rate (850 MHz), an in-order, narrow superscalar execution path, and is constructed 
to be highly reliable and power-efficient, holding 15 of the top 25 slots of the GREEN500 
list as recently as November 2009. Blue Gene continues to generate cutting-edge science, as 
evidenced by the continued presence of Blue Gene systems as winners and finalists in the 
Gordon Bell competition [DEUCEH]. 

In the work presented here our simulations and performance enhancements focus on 
using code synthesis and scheduling to increase arithmetic intensity with unroll-and-jam, 
an optimization technique that creates tiling on multiply nested loops through a two-step 
procedure as in (5j E]. Unroll-and-jam combines two well-known techniques, loop unrolling 
on the outer loops to create multiple inner loops, then loop fusion, or "jamming," to combine 



the inner loops into a single loop. Figure [T] shows an example of unroll-and-jam applied to 
a copy operation between two three-dimensional arrays, where each of the two outer most 
loops is unrolled once. This technique can work well for three-dimensional local operators 
because it promotes register reuse and can increase effective arithmetic intensity, though it 
requires careful register management and instruction scheduling to work effectively. 



for i = 1 to N 
for j = 1 to N 
for k = 1 to N 

A(i,j,k) = R(i,j,k) 
endfor 
endfor 
endfor 



for i = 1 to N step 2 
for j = 1 to N step 2 
for k = 1 to N 
A(i ,j ,k) = R(i ,j ,k) 
A(i+l,j ,k) = R(i+l,j ,k) 
A(i ,j+l,k) = R(i ,j+l,k) 
A(i+l,j + l,k) = R(i+l,j+l,k) 
endfor 
endfor 
endfor 



(a) Regular loop 



(b) Unrolled-and-jammed loop 



Figure 1: Unroll-and-jam example for a three-dimensional array copy operation 



2 Related Work 

Several emerging frameworks are facilitating the development of efficient high performance 
code without having to go down to the assembly level, at least not directly. These frame- 
works are largely motivated by the difficulties involved in utilizing vectorized Floating Point 
Unit (FPU) and other advanced features in the processor. CorePy [30], a Python implemen- 
tation similar to our approach, provides a code synthesis package with an API to develop 
high performance applications by utilizing the low-level features of the processor that are 
usually hidden by the programming languages. Intel has introduced a new dynamic compi- 
lation framework, Array Building Blocks (ArBB) [31J, which represents/enables a high level 
approach to automatically using the SIMD units on Intel processors. In addition, several 
techniques are developed in the literature to address the alignment problems in utilizing the 
SIMD capabilities of modern processors. In [15] . an algorithm is introduced to reorganize 
the data in the registers to satisfy the alignment constraints of the processor. A compila- 
tion technique for data layout transformation is proposed in [20] that reorganizes the data 
statically in memory for minimal alignment conflicts. 

As stencil operators in particular have been identified as an important component of many 
scientific computing applications, a good deal of effort has been spent attempting to improve 
their performance through optimization techniques. Christen et al. optimized a 7-point 
stencil in three-dimensional grids on the CELL BE processor and a GPU system in [9]. On 
CELL BE, they reduced bandwidth requirements through spacial and temporal blocking. To 
improve the computation performance, they utilized the SIMD unit with shuffling intrinsics 
to handle alignment issues. They utilized optimization techniques including preloading, 



loop unrolling, and instruction interleaving. Rivera and Tseng |35J utilized tiling in space 
to improve spatial locality and performance of stencil computations. A domain-specific 
technique is time skewing (28] [26] . Unfortunately, time skewing is not generalizable because 
no other computations between time steps are allowed to occur. Recently, Nguyen et al. 
[32] optimized the performance of a 7-point stencil and a Lattice Boltzmann stencil on 
CPUs and GPUs over three-dimensional grids by performing a combination of spatial and 
temporal blocking, which they dubbed 3.5D blocking, to decrease the memory bandwidth 
requirements of their memory bound problems. Wellein performed temporal blocking on the 
thread level to improve the performance of stencil computations on multicore architectures 
with shared caches in [40J. Kamil conducted a study on the impact of modern memory 
subsystems on three-dimensional stencils [21] and proposed two cache optimization strategies 
for stencil computations in |23J , namely cache-oblivious and cache-aware techniques. Several 
other recent studies in performance improvements in stencil operators, including multilevel 
optimization techniques [HI [33]. 

There are several approaches to obtaining performance improvements in stencil compu- 
tations that are more automated, requiring less manual intervention and resulting in more 
generality: Christen et al. introduced a stencil code generation and auto-tuning framework 
[8], PATUS, targeting CPUs and GPUs. Williams presented an auto-tuning frameworks 
in [12]. They performed their optimization work on a Lattice Boltzmann application over 
several architectures. Kamil built an auto-tuning framework for code generation in [22J. 
Their framework accepts the stencil's kernel in Fortran and then converts it to a tuned ver- 
sion in Fortran, C, or Compute Unified Device Architecture (CUDA). In [37], a software 
synthesis approach was proposed to generate optimized stencil code. Machine Learning 
strategies were proposed in [17] to tune the parameters of the optimization techniques for 
the 7- and the 27-point stencil. Recently, Tang introduced Pochoir stencil compiler in [39J. 
Their framework aims to simplify programming efficient stencil codes, utilizing parallel cache 
oblivious algorithms. 

We call special attention to a comprehensive work by Datta [12], who constructed an 
auto-tuning framework to optimize the 7- and the 27-point stencils and the Gauss-Seidel 
Red-Black Helmholtz kernel. Datta's work was performed on diverse multicore architectures 
modern at the time. He achieved impressive performance by employing a search over a va- 
riety of algorithmic and architecture-targeting techniques including common subexpression 
elimination and Non-Uniform Memory Access (NUMA)-aware allocations. It is interesting to 
note that aside from register blocking and common subexpression elimination, Datta's tech- 
niques were ineffective in improving the performance of stencil operators on the Blue Gene/P 
platform. This was attributed in part to the difficulties in achieving good performance for 
the in-order-execution architecture of the PowerPC 450 processor. 

3 Implementation Considerations 
3.1 Stencil Operators 

A three-dimensional stencil is a linear operator on M> MNP , the space of scalar fields on a 
Cartesian grid of dimension M x N x P. Apart from some remarks in Section [6j we assume 



throughout this paper that the operator does not vary with the location in the grid, as is 
typical for problems with regular mesh spacing and space-invariant physical properties such 
as constant diffusion. The input, A, and output, R, are conventionally stored in a one- 
dimensional array using lexicographic ordering. We choose a C-style ordering convention so 
that an entry ay*, of A has flattened index (iN+j)P + k, with zero-based indexing. Further, 
we assume that the arrays A and R are aligned to 16-byte memory boundaries. 

Formally, the 3-point stencil operator defines a linear mapping from a weighted sum of 
three consecutive elements of A to one element in R: 



We further assume certain symmetries in the stencils that are typical of self-adjoint 
problems, e.g. wo = w^. The effect of this assumption is that fewer registers are required 
for storing the w coefficients of the operator, allowing us to unroll the problem further. This 
assumption is not universally applicable to numerical schemes such as upwinding, where 
adaptive weights are used to capture the direction of flow. 

The 7-point stencil (Figure [2]) defines the result at ry^ as a linear combination of the 
input Oijjfc and its six three-dimensional neighbors with Manhattan distance one. The 27- 
point stencil uses a linear combination of the set of 26 neighbors with Chebyshev distance 
one. The boundary values ryjt with i 6 {0, M — 1}, j 6 {0, iV — 1}, or k G {0, P — 1} are 
not written as is standard for Dirichlet boundary conditions. Other boundary conditions 
would apply a different one-sided stencil at these location, a computationally inexpensive 
consideration that we do not regard for our experiments. 



The 27-point stencil (Figure [3]) can be seen as the summation over nine independent 
3-point stencil operators into a single result. We assume symmetry along but not between 
the three dimensions, leading to 8 unique weight coefficients. The symmetric 7-point stencil 
operator has 4 unique weight coefficients. 



n,j,k = w * a itjtk -x + wi * a i>j>k + w 2 * a iJtk+1 




Figure 2: 7-point stencil operator 



Figure 3: 27-point stencil operator 



3.2 PowerPC 450 

Designed for delivering power-efficient floating point computations, the nodes in a Blue 
Gene/P system are four- way Symmetric Processors (SMPs) comprised of PowerPC 450 pro- 
cessing cores [2T] . The processing cores possess a modest superscalar architecture capable 
of issuing a SIMD floating point instruction in parallel with various integer and load/store 
instructions. Each core has an independent register file containing 32 4-byte general purpose 
registers and 32 16-byte SIMD floating point registers which are operated on by a pair of 
fused floating point units. A multiplexing unit on each end of this chained floating point 
pipeline enables a rich combination of parallel, copy, and cross semantics in the SIMD float- 
ing point operation set. This flexibility in the multiplexing unit can enhance computational 
efficiency by replacing the need for independent copy and swap operations on the floating 
point registers with single operations. To provide backward compatibility as well as some 
additional functionality, these 16-byte SIMD floating point registers are divided into inde- 
pendently addressable, 8-byte primary and secondary registers wherein non-SIMD floating 
point instructions operate transparently on the primary half of each SIMD register. 

An individual PowerPC 450 core has its own 64KB LI cache, divided evenly into a 32KB 
instruction cache and a 32KB data cache. The LI data cache uses a round-robin (First In 
First Out (FIFO)) replacement policy in 16 sets, each with 64- way set-associativity. Each 
LI cache line is 32 bytes in size. 

Every core also has its own private prefetch unit, designated as the L2 cache, "between" 
the LI and the L3. In the default configuration, each PowerPC 450 core can support up to 
5 "deep fetching" streams or up to 7 shallower streams. These values stem from the fact 
that the L2 prefetch cache has 15 128-byte entries. If the system is fetching two lines ahead 
(settings are configured on job startup), each stream occupies three positions, one current 
and two "future," while a shallower prefetch lowers the occupancy to two per stream. The 
final level of cache is the 8MB L3, shared among the four cores. The L3 features a Least 
Recently Used (LRU) replacement policy, with 8-way set associativity and a 128-byte line 
size. 



On this architecture the desired scenario in highly performant numerical codes is the 
dispatch of a SIMD floating point instruction every cycle (in particular, a SIMD Floating 
point Multiply- Add (FMA)), with any load or store involving one of the floating point 
registers issued in parallel, as inputs are streamed in and results are streamed out. Floating 
point instructions can be retired one per cycle, yielding a peak computational throughput 
of one (SIMD) FMA per cycle, leading to a theoretical peak of 3.4 GFlops/s per core. 
Blue Gene/P's floating point load instructions, whether they be SIMD or non-SIMD, can 
be retired every other cycle, leading to an effective read bandwidth to the LI of 8 bytes a 
cycle for aligned 16-byte SIMD loads (non-aligned loads result in a significant performance 
penalty) and 4 bytes a cycle otherwise. As a consequence of the instruction costs, no kernel 
can achieve peak floating point performance if it requires a ratio of load to SIMD floating 
point instructions greater than 0.5. It is important to ensure packed "quad- word" SIMD 
loads occur on 16-byte aligned memory boundaries on the PowerPC 450 to avoid performance 
penalties that ensue from the hardware interrupt that results from misaligned loads or stores. 

An important consideration for achieving high throughput performance on modern float- 
ing point units is pipeline latency, the number of cycles that must transpire between accesses 
to an operand being written or loaded into in order to avoid pipeline hazards (and their con- 
sequent stalls). Floating point computations on the PowerPC 450 have a latency of 5 cycles, 
whereas double-precision loads from the LI require at least 4 cycles and those from the L2 
require approximately 15 cycles [38]. Latency measurements when fulfilling a load request 
from the L3 or DDR memory banks are less precise: in our performance modeling we assume 
an additional 50 cycle average latency penalty for all loads outside the LI that hit in the L3. 

In the event of an LI cache miss, up to three concurrent requests for memory beyond the 
LI can execute (an LI cache miss while 3 requests are "in-flight" will cause a stall until one 
of the requests to the LI has been fulfilled). Without assistance from the L2 cache, this leads 
to a return of 96 bytes (three 32-byte lines) every 56 cycles (50 cycles of memory latency + 
6 cycles of instruction latency), for an effective bandwidth of approximately 1.7 bytes/cycle. 
This architectural characteristic is important in our work, as L3-confined kernels with a 
limited number of streams can effectively utilize the L2 prefetch cache and realize as much 
as 4.5 bytes/cycle bandwidth per core, while those not so constrained will pay the indicated 
bandwidth penalty. 

The PowerPC 450 is an in-order unit with regards to floating point instruction execution. 
An important consequence is that a poorly implemented instruction stream featuring many 
non-interleaved load/store or floating point operations will suffer from frequent structural 
hazard stalls with utilization of only one of the units. Conversely, this in-order nature makes 
the result of efforts to schedule and bundle instructions easier to understand and extend. 

3.3 Instruction Scheduling Optimization 

We wish to minimize the required number of cycles to execute a given code block composed 
of PowerPC 450 assembly instructions. This requires scheduling (reordering) the instructions 
of the code block to avoid the structural and data hazards described in the previous section. 
Although we use greedy heuristics in the current implementation of the code synthesis frame- 
work to schedule instructions, we can formulate the scheduling problem as an Integer Linear 
Programming (ILP) optimization problem. We base our formulation on |7J, which considers 



optimizations combining register allocation and instruction scheduling of architectures with 
multi-issue pipelines. To account for multi-cycle instructions, we include parts of the formu- 
lation in [UJ. We consider two minor extensions to these approaches. First, we consider the 
two separate sets of registers of the PowerPC 450, the General Purpose Registers (GPR), 
and the Floating Point Registers (FPR). Second, we account for instructions that use the 
Load Store Unit (LSU) occupying the pipeline for a varying number of cycles. 

We begin by considering a code block composed of iV instructions initially ordered as 
/ = {ii, I2, 13, In}- A common approach to represent the data dependencies of these 
instructions is to use a Directed Acyclic Graph (DAG). Given a DAG G(V, E), the nodes 
of the graph (V) represent the instructions and the directed edges (E) represent the de- 
pendencies between them. Figure [4] shows an example of a DAG representing a sequence 
of instructions with the edges representing data dependencies between instructions. Each 
read-after-write data dependency of an instruction Ij on an instruction /j is represented by 
a weighted edge e^. This weighted edge corresponds to the number of cycles needed by an 
instruction Ii to produce the results required by an instruction/, . The weights associated 
with write-after-read and write-after-write data dependencies are set to one. 



h 


Id fa,rx,ry 


h 


Id fb,rx,ry 


h 


sub fd, fa, fb 


h 


Id fb, rt, ry 


h 


mul fe, fb, fd 


h 


st fe,rz,ry 


Ir- 


mul fc, fd, fc 


is: 


st fc, rv, ry 



(a) Instruction sequence 




(b) DAG representation 



Figure 4: Example of DAG representation of the instructions 



The PowerPC 450 processor has one LSU and one FPU. This allows the processor to 
execute a maximum of one floating point operation per cycle and one load/store operation 
every two cycles (each operation using the LSU consumes at least 2 cycles). The instructions 
use either the LSU (J i5f/ e LSU) or the other computation resources of the processor 
including the FPU (J FPC/ e FPU). A critical path, C, in the DAG is a path in the graph on 
which the sum of the edge weights attains the maximum. We can compute the lower bound 



(Lbound) to run the instructions I as follows: 



L bound (I) = max{C, 2 * I LSU , I FPU } (1) 

We can compute an upper bound Ub OU nd(I) by simulating the number of cycles to execute 
the scheduled instructions. If Ub OU nd(I) = Lbound{I) we can generate an optimal schedule. 

We define the Boolean optimization variables x 3 i to take the value 1 to represent that the 
instruction Jj is scheduled at a given cycle cP and otherwise. These variables are represented 
in the array shown in Figure [5j where M represents the total number of the cycles. 



h 
h 
h 

In 

Figure 5: A Boolean array of size N x M representing the scheduling variables 

The first constraint in our optimization is to force each instruction to be scheduled only 
once in the code block. Formally: 

M 

524 = 1, Vie {1,2,..., AT} (2) 

J'=l 

The PowerPC 450 processor can execute a maximum of one floating point operation every 
cycle and one load operation every two cycles (store instructions are more complicated, but 
for this derivation we assume two cycles as well). This imposes the following constraints: 

J2 4<h Vje{l,2,...,M} (3) 
J2 (xi+xi +1 )<l,\/je{l,2,...,M-l} (4) 

i:IieI LSU 

Finally, to maintain the correctness of the code's results we enforce the dependency 
constraints: 

AI M 

J2(k * 4) ~ J2( k * x S + ei i + 1 < °' V ^^' : e iJ e E ( 5 ) 

k=l k=l 

The instruction latency to write to a GPR is 1 cycle (e$j = 1). The latency for writing 
to a FPR is 5 cycles for Jj e jFPu an( ^ } eas ^ 4 C y C l es for l i e I LSU . Load instructions 
have higher latency when the loaded data is present in the L3 cache or the RAM. This can 
be considered in future formulations to maximize the number of cycles between loading the 
data of an instruction ij and using it by maximizing in the objective. 
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We also wish to constrain the maximum number of allocated registers in a given code 
block. All the registers are assumed to be allocated and released within the code block. An 
instruction scheduled at a cycle cP allocates a register r by writing on it. The register r is 
released (deallocated) at a cycle c k by the last instruction reading from it. The life span of 
the register r is defined to be from cycle c? to cycle c k inclusive. 

We define two sets of Boolean register optimization variables gf and f- to represent the 
usage of the GPR and the FPR files, respectively. Each of these variables belongs to an 
array of the same size as the array in Figure [H] The value of g\ is set to 1 during the life 
span of a register in the GPR modified by the instruction Jj scheduled at the cycle cP and 
last read by an instruction scheduled at the cycle c k , that is gf = 1 when j < z < k and zero 
otherwise. The same applies to the variables f- to represent the FPR register allocation. 

To compute the values of g\ and f-, we define the temporary variables g\ and ff. Let 
Ki be the number of instructions reading the from the register allocated by the instruction 
ij. These temporary variables are computed as follows: 

p p 

fp = k xj2 x t - Yl E ^ Wi : li writes on FPR ( 6 ) 

2=1 Vj:eij£E 2=1 

P P 

gf = K xJ2x Z i - E Vi : J * writes 071 GPR ( 7 ) 

2=1 Vj:e i:j eE 2=1 

Our optimization variables g\ and // will equal to 1 only when gj > and // > 0, 
respectively. This can be formulated as follow: 

//-//< (8) 

gi-g{<o (9) 
Kxfi-fi>o (10) 

Kxgl-gj>0 (11) 

Now we can constrain the maximum number of used registers FPR max and GPR max by 
the following: 

J2(9l)-GPR max <0, Vje{l,2,...,M} (12) 

i=i 

Af 

J2(fD - FPRmax < 0, Vje{l,2,...,M} (13) 

1=1 



Our optimization objective is to minimize the required cycles to execute the code (C run ). 
The complete formulation of the optimization problem is: 



Minimize: C run (14) 



M 



Subject to: J^xJ = 1, Vz 6 {1, 2, iV} @ 



^ <1, Vje{l,2,..,M} @ 

i:IieI FPU 



^ (^+^+ 1 )<l, VjG{l,2,..,M-l} Q 

M M 

^(fc * x*) - J^(A; * a;*) + ^ + 1 < 0, Vi, j : ^ eE @ 
fc=i fc=i 

JV 

J2(<d)-GPRma X <0, Vje{l,2,..,M} Q) 

i=l 
JV 

^(//)-FP J R ma:c <0, VjG{l,2,...,M} @ 



JM 



2_.j * x i ~ C run < 0, Vz : Jj has no successors (sink node) (15) 

3=1 



The general form of this optimization problem is known to be NP-complete [19] , although 
many efficient implementations exist for solving integer problems exist; no known algorithms 
can guarantee global solutions in polynomial time. In our implementation, we use a greedy 



algorithm that yields an approximate solution in practical time as we describe in 4.4 



4 Implementation 

4.1 C and Fortran 

We implement the three streaming kernels in C and Fortran and utilize published results 
as a benchmark to gain a better understanding of the relative performance of the general 
class of streaming numerical kernels on the IBM PowerPC 450. Our first challenge is in 
expressing a SIMDized mapping of the stencil operator in C or Fortran. Neither language 
natively supports the concept of a unit of SIMD-packed doubles, though Fortran's complex 
type comes very close. Complicating matters further, the odd cardinality of the stencil 
operators necessitates careful tactics to efficiently SIMDize the code. Regardless of the 
unrolling strategy employed, one of every two loads is unaligned or requires clever register 
manipulation. It is our experience that standard C and Fortran implementations of our 
stencil operators will compile into exclusively scalar instructions that cannot attain greater 
than half of peak floating point performance. For example, with a naive count of 53 flops 
per 27-point stencil, peak performance is 62 Mstencil/s and we observe 31.5 Mstencil/s 
on a single core of the PowerPC 450. Register pressure and pipeline stalls further reduce 



the true performance to 21.5 Mstencil/s. We note that Datta [TT] was able to improve 
this to 25 Mstencil/s using manual unrolling, common subexpression elimination, and other 
optimizations within C. 

The XL compilers support intrinsics which permit the programmer to specify which 
instructions are used, but the correspondence between intrinsics and instructions is not 
exact enough for our purposes. The use of intrinsics can aid programmer productivity, as 
this method relies upon the compiler for such tasks as register allocation and instruction 
scheduling. However, our methods require precise control of these aspects of the produced 
assembly code, making this path unsuitable for our needs. 

Our code generation framework is shown in Figure |6j High-level Python code produces 
the instruction sequence of each code block. First, registers are allocated by assigning them 
variable names. Next, a list of instruction objects is created that utilize the variables to 
address registers. Enabling the instruction scheduler allows the simulator to execute the 
instructions out of order; otherwise they will be executed in order. The instruction simulator 
uses virtual GPR, FPR, and memory to simulate the pipeline execution and to simulate the 
expected results of the given instruction sequence. A log is produced by the instruction 
simulator to provide the simulation details, showing the cycles at which the instructions are 
scheduled and any encountered data and structural hazards. Also, the log can contain the 
contents of the GPR, the FPR, and the memory, at any cycle, to debug the code for results 
correctness. The C code generator takes the simulated instructions and generates their 
equivalent inline assembly code in a provided C code template using the inline assembly 
extension to C standard provided by GCC. For added clarity, we associate each generated 
line with a comment showing the mapping between the used registers numbers and their 
corresponding variables names in the Python code. 
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Figure 6: The components of the code generation framework in this work 



4.2 Kernel Design 



The challenges in writing fast kernels in C and Fortran motivate us to program at the assem- 
bly level, a (perhaps surprisingly) productive task when using our code synthesis framework: 
an experienced user was able to design, implement, and test several efficient kernels for a new 
stencil operator in one day using the framework. Much of our work is based on the design of 
two small and naively scheduled, 3-point kernels, dubbed mutate-mutate and load-copy, that 



we introduce in The kernels distinguish themselves from each other by their relative 
balance between load/store and floating point cycles consumed per stencil. 

We find that efficiently utilizing SIMD units in stencil computations is a challenging task. 
To fill the SIMD registers, we pack two consecutive data elements from the fastest moving 
dimension, k, allowing us to compute two stencils simultaneously, as in 0. Computing in this 
manner is semantically equivalent to an unrolling by 2 in the k direction. As a connventional 
notation, since % and j are static for any given stream, we denote the two values occupying 
the SIMD register for a given array by their k indices, e.g., SIMD register contains Aij^ 
in its primary half and Ai j4 in its secondary half. 

Many stencil operators map a subset of adjacent A elements with odd cardinality to each 
R element, as is illustrated in the left half of Figure [7j which depicts the SIMD register 
contents and computations mid-stream of a 3-point kernel. The odd cardinality prevents 
a straightforward mapping from SIMD input registers to SIMD output registers. We note 
that aligned SIMD loads from A easily allow for the initialization of registers containing (Z23 
and 045. Similarly, the results in r 34 can be safely stored using a SIMD store. The register 
containing a 34 , unaligned elements common to the aligned registers containing adjacent data, 
requires a shuffle within the SIMD registers through the use of additional load or floating 
point move instructions. 

We introduce two kernels, which we designate as mutate-mutate (mm) and load-copy 
(lc), as two different approaches to form the packed data into the unaligned SIMD registers 
while streaming through A in memory. A "mutate" operation is defined as the replacment 
of one operand of a SIMD register by a value loaded from memory. A "load" operation 
is defined as the replacement of both operands in a SIMD register by a SIMD load from 
memory. A "copy" operation is the replacement of one operand of a SIMD register by an 
operand from another SIMD register. Mutates and loads utilize the LSU, copies utilize the 
FPU. 

The mutate-mutate kernel replaces the older half of the SIMD register by with the next 
element of the stream. In our example in Figure [7] we start with 023 loaded in the SIMD 
register, then after the first computation we load a 4 into the primary part of the SIMD 
register so that the full contents are a 43 . 

The load-copy kernel instead combines the unaligned values in a SIMD register from 
two consecutive aligned quad-words loaded in two SIMD registers by copying, the primary 
element from the second register to the primary element of the first. In our example 023 and 
a 4 5 are loaded in two SIMD registers. Then, after the needed computations involving 023 
have been dispatched, a floating point move instruction replaces a 2 with a 4 to form a 43 . 

The two kernels use an identical set of floating point instructions, visually depicted in 
the right half of Figure [7j to accumulate the computations into the result registers. The 
two needed weight coefficients are packed into one SIMD register. The first floating point 
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(b) Compute contributions from 043 
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(c) Compute contributions from 045 
Figure 7: SIMD stencil computations on one-dimensional streams 



operation is a cross copy-primary multiply instruction, multiplying two copies of the first 
weight coefficient by the two values in 023, then placing the results in the SIMD register 



containing (Figure 7a). Then, the value of 023 in the SIMD register is modified to 
become 043, replacing one data element in the SIMD register either through mutate or copy. 
The second floating point operation is a cross complex multiply-add instruction, performing 



a cross operation to deal with the reversed values (Figure 7b). Finally, the value a 45 , which 
has either been preloaded by load-copy or is created by a second mutation in mutate-mutate, 



is used to perform the last computation (Figure 7c). 

We list the resource requirements of the load-copy and the mutate-mutate kernels in 
Table [TJ The two kernels are at complementary ends of a spectrum. The mutate-mutate 
kernel increases pressure exclusively on the load pipeline while the load-copy kernel incurs 
extra cycles on the floating point unit. The two strategies can be used in concert, using 
mutate-mutate when the floating point unit is the bottleneck and the load-copy when it is 
not. 



Table 1: Resource usage per stencil of mutate- mutate and load-copy 
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4.3 Unroll- and- Jam 

The 3-point kernels are relatively easy to specify in assembly, but the floating point and 
load/store instruction latencies will cause pipeline stalls if they are not unrolled. Further 
unrolling in the /c-direction beyond two is a possible solution that we do not explore in this 
paper. Although this solution would reduce the number of concurrent memory streams, it 
would also reduce data reuse for the other stencils studied in this paper. 

Unrolling and jamming once in transverse directions provides independent arithmetic 
operations to hide instruction latency, but interleaving the instructions by hand produces 
large kernels that are difficult to understand and modify. To simplify the design process, we 
constructed a synthetic code generator and simulator with reordering capability to interleave 
the jammed FPU and load/store instructions to minimize pipeline stalls. The synthetic code 
generator also gives us the flexibility to implement many general stencil operators, including 
the 7-point and 27-point stencils using the 3-point stencil as a building block. 

Unroll- an d-j am serves a second purpose for the 7-point and 27-point stencil operators due 
to the overlapped data usage among adjacent stencils. Unroll- and-j am eliminates redundant 
loads of common data elements among the jammed stencils, reducing pressure on the mem- 
ory subsystem by increasing the effective arithmetic intensity of the kernel. This can be 
quantified by comparing the number of input streams, which we refer to as the "frame size," 
with the number of output streams. For example, an unjammed 27-point stencil requires a 
frame size of 9 input streams for a single output stream. If we unroll once in i and j, we 
generate a 2 x 2 jam with 4 output streams using a frame size of 16, improving the effective 
arithmetic intensity by a factor of |. 

We used mutate-mutate and load-copy kernels to construct 3-, 7-, and 27-point stencil 
kernels over several different unrolling configurations. Table [2] lists the register allocation 
requirements for these configurations and provides per-cycle computational requirements. 
In both the mutate-mutate and load-copy kernels, the 27-point stencil is theoretically FPU- 
bound because of the high reuse of loaded input data elements across streams. 

As can be seen in Table[TJ the mutate-mutate kernel allows more unrolling for the 27-point 
stencil than the load-copy kernel because it uses fewer registers per stencil. The number of 
allocated registers for input data streams at the mutate-mutate kernel is equal to the number 
of the input data streams, while the load-copy kernel requires twice the number of registers 
for its input data streams. 

4.4 PowerPC 450 Simulator 

The high-level code synthesis technique generates as many as hundreds of assembly instruc- 
tions with aggressive unrolling. For optimal performance, we greedily schedule the assembly 



Table 2: Computational requirements 
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instructions for in-order execution using the constraints outlined in 15 First, we produce 



a list of non-redundant instructions using a Python generator. The simulator reflects an 
understanding of the constraints by modeling the instruction set, including semantics such 
as read and write dependencies, instruction latency, and which execution unit it occupies. 
It functions as if it were a PowerPC 450 with an infinite-lookahead, greedy, out-of-order 
execution unit. On each cycle, it attempts to start an instruction on both the load/store 
and floating point execution units while observing instruction dependencies. If this is not 
possible, it provides diagnostics about the size of the stall and what dependencies prevented 
another instruction from being scheduled. The simulator both modifies internal registers 
that can be inspected for verification purposes and produces a log of the reordered instruc- 
tion schedule. The log is then rendered as inline assembly code which can be compiled using 
the XL or GNU C compilers. 



5 Performance 

Code synthesis allows us to easily generate and verify the performance of 3-, 7-, and 27-point 
stencil operators against our predictive models over a range of unrolling- an d-jamming and 
inner kernel options. We use individual MPI processes mapped to the four PowerPC 450 
cores to provide 4-way parallelization and ascertain performance characteristics out to the 
shared L3 and DDR memory banks. 

The generated assembly code, comprising the innermost loop, incurs a constant compu- 
tational overhead from the prologue, epilogue, and registers saving/restoring. The relative 
significance of this overhead is reduced when larger computations are performed at the 
innermost loop. This motivated us to decompose the problem's domain among the four 
cores along the outermost dimension for the 3- and 7-point stencils, resulting in better per- 
formance. However, the 27-point stencil has another important property, dominating the 
innermost loop overhead cost. Its computations inhibit relatively high input data points 
sharing among neighbor stencils. Splitting the innermost dimension allows the shared input 
data points to be reused by consecutive middle dimension iterations, where the processor 
will likely have them in the LI cache, resulting in faster computations. Conversely, if the 
computation is performed using a large innermost dimension, the shared input data points 
will no longer be in the LI cache at the beginning of the next iteration of the middle loop. 

All three studies were conducted over a range of cubic problems from size 14 3 to 362 3 . 
The problem sizes were chosen such that all variations of the kernels can be evaluated naively 
without extra code for cleanup. We computed each sample in a nested loop to reduce noise 
from startup costs and timer resolution, then select the highest performing average from 
the inner loop samples. There was almost no noticeable jitter in sample times after the 
first several measurements. All performance results are given per-core, though results were 
computed on an entire node with a shared L3 cache. Thus, full node performance can be 
obtained by multiplying by four. 

During the course of our experiments on the stencils, we noticed performance problems 
for many of the stencil variants when loading data from the L3 cache. The large number of 
concurrent hardware streams in the unroll- and-j am approach overwhelms the L2 streaming 
unit, degrading performance. This effect can be amplified in the default optimistic prefetch 



mode for the L2, causing wasted memory traffic from the L3. We made use of a boot option 
that disables optimistic prefetch from the L2 and compare against the default mode where 
applicable in our results. We distinguish the two modes by using solid lines to indicate 
performance results obtained in the default mode and dashed lines to indicate results where 
the optimistic prefetch in L2 has been disabled. 



5.1 3-Point Stencil Computations 

We begin our experiments with the 3-point stencil (Figure [8]), the computational building 
block for the other experiments in our work. For a more accurate streaming bandwidth 
peak estimation, we considers 3.7 bytes/s read, from the DRAM, and 5.3 bytes/s write 
bandwidth at the stencil computations. Also, we compute the L3 bandwidth peak using 
4.7 bytes/s read and 5.3 bytes/s write bandwidth at the stencil computations. The 3-point 
stencil has the lowest arithmetic intensity of the three stencils studied, and unlike its 7-point 
and 27-point cousins, does not see an increase in effective arithmetic intensity when unroll- 



and-jam is employed. It is clear from Section 4.2 that the load-copy kernel is more efficient 



in bandwidth-bound situations, so we use it as the basis for our unroll- and-j am experiments. 
We see the strongest performance in the three problems that fit partially in the LI cache 
(the peak of 224 Mstencil/s is observed at 26 3 ), with a drastic drop off as the problem inputs 
transition to the L3. The most robust kernel is the 2x1 jam, which reads and writes to two 
streams simultaneously, and can therefore engage the L2 prefetch unit most effectively. The 
larger unrolls (2x2, 2x3, and 2x4), enjoy greater performance in and near the LI, but then 
suffer drastic performance penalties as they exit the LI and yet another performance dip 
near 250 3 . Disabling optimistic L2 prefetch does not seem to have any large effect on the 
2x1 kernel, though it unreliably helps or hinders the other kernels. 

The 3-point kernel seems to be an ideal target on the PowerPC 450 for standard unrolling 
in the fastest moving dimension, k, a technique we did not attempt due to its limited appli- 
cation to the larger problems we studied. Unroll- and-j am at sufficient sizes to properly cover 
pipeline hazards overwhelms the L2 streaming unit due to the large number of simultaneous 
streams to memory. Unrolling in k would cover these pipeline hazards without increasing 
the number of streams. 



5.2 7-Point Stencil Computations 

Our next experiment focuses on the performance of the 7-point stencil operator (Figure [9]). 
We compare the mutate-mutate and load-copy kernels using the same unroll configurations. 
We note that the mutate-mutate kernel can support a slightly more aggressive unroll-and- 
jam on this problem with a compressed usage of general purpose registers that was only 
implemented for the 27-point stencil. 

Once again we notice strong performance within the LI, then a dropoff as the loads start 
coming from the L3 instead of the LI. The performance drop near 256 3 is caused when the 
2x3 kernel's frame size of (2 + 2) (3 + 2) — 4 = 16 multiplied by the length of the domain 
exceeds the size of LI. For smaller sizes, neighbors in the j direction can reside in LI between 
consecutive passes so that only part of the input frame needs to be supplied by streams from 
memory. With up to 16 input streams and 2-3 = 6 output streams, there is no hope of 
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Figure 8: 3-point stencil performance with load-copy kernel at various unroll- and-jams (ixj). 
L3 bandwidth peak is 265 Mstencil/s 



effectively using the L2 prefetch unit. The load-copy kernel shows better performance than 
the mutate-mutate kernel, as it is clear here that load/store cycles are more constrained than 
floating point cycles. We also notice that performance of the load-copy kernel improves with 
the L2 optimistic prefetch disabled slightly within the L3, and drastically when going to the 
DDR banks. This is likely due to the kernel's improved performance, and therefore increased 
sensitivity with regards to latency from the memory subsystem. It is likely that the 7-point 
stencil could attain better results by incorporating cache tiling strategies, though we note 
that without any attempts at cache tiling the performance of this result is commensurate with 
previously reported results for the PowerPC 450 that focused on cache tiling for performance 
tuning. 



5.3 27-Point Stencil Computations 

The 27-point stencil should be amenable to using a large number of jammed unrolls due to 
the high level of reuse between neighboring stencils. Indeed, we see nearly perfect scaling in 
Figure 10 as we increase the number of jams from 1 to 6 using the mutate-mutate kernel. 
Although there is a gradual drop off from the peak of 54 Mstencil/s (85% of arithmetic peak) 
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Figure 9: 7-point stencil performance comparison between mutate- mutate (mm) and load- 
copy (lc) kernels at a fixed unroll- an d-j am (i=2 and j=3). Experiments with L2 optimistic 
prefetch ends with "o" 

as the problem sizes increase to the point that there is little reuse from the LI cache, the 
kernel consistently sustains an average of 45 Mstencil/s (70% of arithmetic peak), even when 
the problem sizes greatly exceed the L3 cache. 

Despite the L2-overwhelming frame size ((2 + 2) (3 + 2) = 20 input streams and 2-3 = 6 
output streams), the jammed stencil achieves good performance with no blocking largely due 
to the high level of reuse of input data afforded by the unrolls in i and j. 

5.4 Model Validation 

As we utilized a simulator which incorporates a model of the architecture's performance 
characteristics to produce our kernels, we sought to validate our performance model by 
comparing the implicit predictions of our generative system to the empirical results seen in 
Table 1 

Since performance within a core is often considerably easier to predict than when one 
must go beyond the core for memory access, we divide our comparisons into those on-core 
(LI) and those that go off the core, to L3 or main memory (streaming). 



70 








50 



100 



150 200 250 
Input length 



300 



350 



400 



Figure 10: 27-point stencil performance with mutate-mutate kernel at various unroll- and- 
jams (ixj). Best L3 bandwidth peak is 118 Mstencil/s and best streaming bandwidth peak 
is 97.5 Mstencil/s, both corresponds to 2x3 unroll- and-j am 



Our modeling of the 27-point stencils can be seen to be highly accurate in Table [3] 
Inside the LI cache the disparity between predicted and actual performance is consistently 
less than 1%. Shifting our attention to the streaming predictions, our accuracy can be seen 
to be considerably degraded. This is not surprising, given that our simulator was largely 
targeted to model the LI domain. However, the relative error is less than 15% in all cases; 
this appears to be sufficient for producing highly efficient code. This shortcoming appears 
to stem directly from the level of detail with which we model the shared L3 cache and main 
memory subsystem and we are working to address this in our simulator. 

The match between predicted and witnessed performance for the 7-point stencil shows 
the same pattern. When modeling performance inside the LI our relative error is less than 
10%, but when extending our prediction to the components of the system shared by all 
four cores, our error is as great as 17.5%. Our greatest error in this instance is an under- 
prediction that is probably attributable to a fortuitous alignment of the continuous vectors 
in the k-direction, as staggering these carefully often result in bandwidth benefits on the 
order of 10-15%. 



Finally, we assess our model for the 3-point stencil. Again we see good agreement between 
the model and observed performance within the LI, though prediction accuracy degrades for 
problem sizes requiring streaming. From some further experimentation, we are reasonably 
certain that the chief reason for our lack of accuracy in predicting performance outside of 
the LI stems from the bandwidth that must be shared between the multiple write streams 
and our failure to account for this in our model. It is most apparent in the 3-point stencil 
predictions as the ratio of write streams to either read streams or floating point operations 
is highest in this case. 

6 Concluding Remarks 

6.1 Conclusion 

The main contribution of this work is effective register and instruction scheduling for constant 
coefficient linear operators on power-efficient processors. The loads of the input vector 
elements and stores of the output vector elements are minimized and the fraction of multiply- 
adds among all cycles is maximized. This is achieved by using two novel 3-point stream- 
computation sub-kernels designed for the PowerPC 450's instruction set, mutate-mutate 
and load-copy. Both kernels were possible without data layout reordering because of the 
extensively multiplexed SIMD-like floating point units implemented in the PowerPC 450 
core. 

Recommendations for the research agenda for computational software libraries in the 
exascale domain include the the fusion of library routine implementations as well as the 
creation of frameworks that enable the optimal instantiation of a given routine when supplied 
with architectural information [13J. We feel that the work presented here is a contribution 
to that end. Further, the nature of our simulator allows us to optimize our code as measured 
by other metrics such as bandwidth or energy consumption, given a simple model of the 
cost. We also demonstrate performance comparable to advanced cache tiling approaches for 
the 7-point stencil, despite the fact that we make no effort to optimize for cache reuse. 

6.2 Future Work 

While the three problems considered (3-point stencils in one dimension and 7-point and 
27-point stencils in three dimensions, with constant coefficients and symmetry within each 
spatial dimension, but not across them) are heavily used in applications, there are numerous 
generalizations. The suitability of our approach can be characterized by the arithmetic 
intensity associated with each generalization. We elaborate on two that tend to increase 
the arithmetic intensity, higher-order stencils and chained iterative passes over the vectors, 
and two that tend to decrease arithmetic intensity, irregular stencils and spatial varying 
coefficients. 

Higher-order stencils expand the number of adjacent input vector elements that enter into 
a single output vector element, in successive steps of semi-width one in each of the spatial 
dimensions i, j, and k, with an additional weight coefficient corresponding to each additional 
increment of semi-width in each dimension. This is a modest generalization. Higher-order 



Table 3: Predictions vs. observations for in-Ll and streaming performance, in Megastencils 
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discretization increases register pressure because of the larger number of inputs that combine 
in each output. Opportunities for reuse of input elements expand with the stencil width up to 
the ability to keep them resident. In a P-point regular stencil (regardless of number of spatial 
dimensions) each input element is operated upon with a pre-stored weight P times: once in 
the stencil centered upon it, and once in each neighboring stencil with which its own stencil 
overlaps. Floating point arithmetic intensity increases in proportion to P. That is, if there 
are N elements in the input or output array, there are PN floating point multiply-adds per TV 
floating reads and N floating writes. Explicit methods for nonlinear systems, especially with 
high-order discretization techniques such as Weighted Essentially Non-Oscillatory (WENO) 
or discontinuous Galerkin [36], have similar properties, including a larger number of input 
streams, but with much higher arithmetic intensity. 

S'-stage chaining (as in the simultaneous accumulation of Ax, A 2 x, A 3 x, . . . A s x) allows 
the output vector to be fed back as input before being written. Per output vector of iV 
floating point writes, there are N/S reads and PN floating point multiply-adds. Therefore, 
up to the ability to keep the additional operands cached, both higher-order operators and 
chained operations improve the potential for the transformations described here. 

Irregular stencils require integer reads, in addition to floating point reads to determine 
which elements of the input vector go with each row of the matrix. This further dilutes ad- 
vantages that lead to the great breakthroughs in stencils per second described here. Stencil 
operations with constant coefficients and sparse matrix-vector multiplies with general coef- 
ficients are similar when counting floating operations, but very different when it comes to 
data volume. 

Spatially varying coefficients require the loading of additional weights, each of which is 
used only once, each of which is of the same floating precision of the input and output vectors, 
P of them in the production of each output vector element, with each input vector element 
being combined with P different weights. While each input and output element can still be 
reused up to P times in the execution of one pass through the overall vector-to-vector map, 
the dominant array in the workspace is the coefficient matrix of weights so the benefits of 
reusing the vectors are minimal. This situation is typical for nonlinear problems when using 
Newton-Krylov and linear multigrid methods. However, when "free flops" are available, the 
weights can also be recomputed on the fly as a nonlinear function of a given state and/or 
scalar coefficients. In this case, the number of input streams is similar to the linear constant 
coefficient case (perhaps larger by a factor of 2 or 3), but the number of floating point results 
is several times higher and involves the problem-specific "physics." Putting the physics inside 
the kernel like this suggests that there will be an emphasis on the ability to quickly develop 
high-performance kernels for new physics. 
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